Wu et al. BMC Bioinformatics 2014, 15(Suppl 2):S9 
http://www.biomedcentral.eom/1 471 -2 1 05/1 5/S2/S9 



Bioinformatics 



PROCEEDINGS Open Access 



Collective prediction of protein functions from 
protein-protein interaction networks 

Qingyao Wu 1,2 , Yunming Ye 1,2 *, Michael K Ng 3 , Shen-Shyang Ho 4 , Ruichao Shi 1,2 

From The Twelfth Asia Pacific Bioinformatics Conference (APBC 2014) 
Shanghai, China. 17-19 January 2014 



Abstract 

Background: Automated assignment of functions to unknown proteins is one of the most important task in 
computational biology. The development of experimental methods for genome scale analysis of molecular 
interaction networks offers new ways to infer protein function from protein-protein interaction (PPI) network data. 
Existing techniques for collective classification (CC) usually increase accuracy for network data, wherein instances are 
interlinked with each other, using a large amount of labeled data for training. However, the labeled data are time- 
consuming and expensive to obtain. On the other hand, one can easily obtain large amount of unlabeled data. 
Thus, more sophisticated methods are needed to exploit the unlabeled data to increase prediction accuracy for 
protein function prediction. 

Results: In this paper, we propose an effective Markov chain based CC algorithm (ICAM) to tackle the label 
deficiency problem in CC for interrelated proteins from PPI networks. Our idea is to model the problem using two 
distinct Markov chain classifiers to make separate predictions with regard to attribute features from protein data 
and relational features from relational information. The ICAM learning algorithm combines the results of the two 
classifiers to compute the ranks of labels to indicate the importance of a set of labels to an instance, and uses an 
ICA framework to iteratively refine the learning models for improving performance of protein function prediction 
from PPI networks in the paucity of labeled data. 

Conclusion: Experimental results on the real-world Yeast protein-protein interaction datasets show that our 
proposed ICAM method is better than the other ICA-type methods given limited labeled training data. This 
approach can serve as a valuable tool for the study of protein function prediction from PPI networks. 



Background 

We have witnessed a revolution in sequencing technolo- 
gies in last decade. The biological sciences are under- 
going an explosion in the amount of genome sequences. 
There are increasing interests about using computa- 
tional methods to identify the biological functions of the 
protein sequences [1], as experimentally determining 
protein functions is time-consuming and it cannot catch 
up with the fast growth of newly found proteins [2]. 

Various studies have applied machine learning methods 
to protein data from biological experiments to predict the 
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functions for unknown proteins, (e.g. [3,4]). Classical 
computational approaches for protein function predic- 
tion represent each protein as a set of features, and 
employ machine learning algorithms to automatically 
predict the protein function based on these features. The 
most well-established methods [5] are the BLAST [6] 
approach based on sequence, PROSITE [7] based on 
sequence motifs, and PFAM [8] based on profile 
methods. 

In recent years, the development of experimental meth- 
ods for genome scale analysis of molecular interaction 
networks offers new ways to infer protein function in the 
context of protein-protein interaction (PPI) network, 
wherein proteins and detected PPIs are represented by 
nodes and edges, respectively. The basic idea is that the 
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direct interaction partners of a protein are likely to share 
similar biological functions [9]. Assignment of protein 
functions using PPI data has also been extensively stu- 
died, such as neighborhood counting based method [10], 
graph theoretic methods [11], hierarchical clustering- 
based methods [12] and graph clustering methods [13]. 
Although many efforts have been made in protein func- 
tion prediction, most of them were based on either 
sequence similarity that ignores the protein interactions, 
or PPI information without using attributes derived from 
the content of protein sequence. The former method 
often fails to work if a query protein has no or very little 
sequence similarity to any proteins of known labels, the 
latter method has similar problem if there are insufficient 
relevant PPI information. 

To explicitly use the information of the content of the 
data and the links information of the PPI network to 
improve the prediction performance, collective classifica- 
tion (CC) is proposed. It received considerable atten- 
tions in the last decade. Various CC algorithms has 
been proposed in the literature [14], such as the iterative 
classification algorithm (ICA) [15], Gibbs sampling 
(Gibbs) [16], and variants of the weighted-vote relational 
neighbor algorithm (wvRN) [17]. Here, we focus on 
ICA-type approaches, which consist of a local classifier, 
such as ZrNN, to infer the class labels of related 
instances. The key idea is to construct new relational 
feature vectors by summarizing the label information 
from neighborhood nodes, and then use the relational 
features together with the attribute features derived 
from the content of data to learn local classifiers for 
prediction. 

Figure 1 is an illustration of how ICA proceeds. In 
Figure 1(a), an attribute-only classifier M A induced from 
using only the attribute features is first learned to esti- 
mate the classes of unlabeled instances. The algorithm 
then employs an aggregation function to compute the 
relational features by counting the number of neighbors 
with respect to each label. Once the features are con- 
structed, a collective classifier, M AR , is learned using both 
the attribute features and relational features (Figure 1(b)); 
The algorithm repeats step c and step d to make new 




prediction for unlabeled instances (Figure 1(c)), and to 
update the relational features based on the new generated 
predictions (Figure 1(d)). The ICA-type of algorithms 
usually assume a separate training graph with abundant 
labeled data. However, in many applications such as pro- 
tein function prediction problems, the number of labeled 
protein data is actually very limited and very expensive to 
obtain. In this situation, most data have no connection to 
labeled data, and supervision knowledge cannot be 
obtained from the local connections (as illustrated in 
Figure 1(a)). As a result, the collective classifier M AR 
learned from these networks may suffer a reduction in 
the classification performance. 

This paper describes an effective Markov chain based CC 
algorithm (ICAM) to tackle the label deficiency problem in 
CC for protein function prediction from PPI networks. 
Our idea is to model the classifier M AR via the Markov 
chain with restart. The Markov chain model computes the 
ranks of labels to indicate the importance of a set of labels 
to an instance by propagating the label information in a 
graph constructed from labeled and unlabeled data. The 
ICAM algorithm further refines the Markov chain model 
using an ICA framework to generate the possible labels for 
a given instance. By these techniques, M AR can be learned 
more effectively. Experiments on the realworld Yeast PPI 
datasets have demonstrated that our proposed ICAM 
method improves the classification performance when 
compared with the ICA-type CC methods. The main con- 
tributions of this paper are as follows. 

• We study the label deficiency problem of collective 
classification (CC) and show that the protein func- 
tion prediction problem from PPI networks can be 
formulated as a CC task. 

• We extend the ICA-type CC algorithm and pro- 
pose the ICAM algorithm to leverage the unlabeled 
portion of the data to improve the classification per- 
formance of CC via the Markov chain with restart. 

• We demonstrate the effectiveness of our proposed 
ICAM algorithm using the Yeast benchmark data- 
sets. We find that ICAM leads to significant accu- 
racy gains compared to other ICA-type methods 




Relational features: <2, 2> Relational features: <1,3> 

(a) (b) (c) (d) 

Figure 1 An example of ICA algorithm learning with limited labeled data, (a) initial state, train classifier M A and classify 1/°; (b) Compute 
relational features X R , train classifier M AR ; (c) re-predict 1/° (use M AR ); (d) re-compute relational features X R . ICA repeats step c and step d until a 
fixed iteration number. 
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when there are limited numbers of labeled data 
available. 

Methods 

Preliminaries 

Assume that the PPI network data are represented as a 
graph G(V, E, X A , Y, c), where V is a set of nodes, £ is a 
set of edges representing the interactions between the 
instances. Each instance/node v t £ Vis described by an 
attribute vector x t G X A . Each Y t e Y is a set of labels 
for v it and c is the number of possible labels. Assume 
that we have a set of labeled nodes c V with known 
labels Y K = {Y,|v ; e V*}, and the task is to predict the 
labels Y u for unlabeled nodes V u = V - V*. In this 
paper, we are primarily interested in generating a rank- 
ing of possible labels for a given protein such that its 
correct functions receive higher ranking than the less 
unlikely one. 

The ICAM algorithm 

Inspired by the ICA approach, we introduce the ICAM 
algorithm for collective classification. The algorithm is 
summarized in Algorithm 1. Similar to the ICA frame- 
work, the ICAM algorithm has two parts as follows: 
bootstrap and iterative inference. The bootstrap part 
learns an attribute-only classifier M A from the known 
nodes, and uses M A to predict labels for the unknown 
nodes V 11 (step 1-2). In the iterative inference part, the 
relational features X R are updated based on the esti- 
mated class labels of data (step 4). Specifically, X R of the 
(i + l)-th iteration is based on the known and predicted 
labels from the i-th iteration. Next, the algorithm trains 
a collective classifier M AR using both attribute features 
X A and relational features X R to compute the labels for 
unlabeled data. The iterative process stops when the 
predictions of M AR are stabilized or a fixed number of 
iteration is reached. 

An important component of the ICA algorithm is to 
build the relational features that summarizes the relational 
information, and to construct new feature vectors to train 
the classifier M AR . For instance, Neville et al. [15] summar- 
ize the labels of neighboring nodes as relational features as 
illustrated in Figure 1(b), where node "B" has two positive 
neighboring nodes and two negative neighboring nodes. 
Here, the relational features is "\2, 2^", and then "\2, 2£' is 
appended onto the original feature vector, 
new features, " <x iA , x ii2 , -, 2, 2 >". ICA-type CC methods 
usually increase accuracy for network data using a large 
amount of labeled data to train M AR . In this scenario, 
the supervision knowledge can be effectively propagated in 
the network and improve the learning accuracy [18]. How- 
ever, the labeled data are time-consuming to obtain and 
the number of labeled data is very limited. Most of the 
nodes may not link to the labeled nodes, as illustrated in 



Figure 1(a). As a result, the prediction accuracy of the col- 
lective classifier M AR will be decreased greatly. 

Algorithm 1 ICAM (V, E, X A , Y K , n) 

Input: 

V = nodes, E = edges, X A = attribute feature vectors, 
Y K = labels of known nodes, n = # of iterations, 
Procedure: 



1: M A = learnClassifer{X A , Y K ); 

2: Y u =predict{M A ,X"); 

3: for t = 0 to n do 

4: X R = aggregation^, E, Y !< U Y"); 

5: Re-train M AR = learnClassifer{X A , X R , 

6: Y u = predict{M AR , V, E, X", X", Y K ); 

7: end for 

8: return Y 11 



1*); 



In our ICAM algorithm, we assume that the attribute 
features x t and the relational features r, are conditionally 
independent given the class label Y t [19]. We then use 
two distinct classifiers to make two separate predictions 
for attribute features X A and relational features X R . The 
prediction is given as follows: 



p(F;|X;,r;) 



p{x i \Y i )p{r i \Y i )p{Y i ) 

p{xi,ri) 
p{Y i \Xi)p{xi)p{Y i \r i )p{r i ) 



PiYi) 



PiYi) 



PiYi) 



p{xi,n) 

p{Yi\Xi)p{Yi\ri) 
= Y PiYi) 

where 7 is a constant independent of Y t . The attribute 
classifier to estimate p{Y t \xi) is referred to as M A , and 
the relational classifier to estimate p{Y t \rj) is referred to 
as M R . 

There are two main advantages of this prediction 
method. First, this method allows us to train classifiers 
M A and M R for attribute features X A and relational fea- 
tures X R in parallel. Second, in the collective inference 
process, the classifier M R can be re-trained in each itera- 
tion based on the re-constructed relational features X R 
to improve the prediction accuracy of the collective clas- 
sifier M AR . 

Various traditional supervised learning methods can be 
used to train M A and M R where the classifier, such as 
/rNN, naive bayes and logistic regression [16,20], is learned 
from a separate training data with a large amount of 
labeled data. However, when dealing with label deficiency 
problem in PPI networks, we propose to use transductive 
learning method for acquiring additional information from 
unlabeled data to improve the classification performance. 
Specifically, we set up Markov chain based learning mod- 
els to estimate p(Yi\Xj) and p(Y\r^. 
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Markov chain based learning 

The Markov chain based learning model is based on the 
idea of random walk with restart. We note that there are 
many learning tasks using random walk techniques such 
as protein network cluster discovery [21], community 
discovery [22], multi-instance multi-label learning [23], 
and transfer learning [24]. The idea of random walk with 
restart is to consider a random walker that starts from 
labeled nodes, and iteratively transmits to its neighbor- 
hood with probabilities proportional to their edge 
weights. At each step, it has a probability a(0 < a <1) to 
return back to labeled node. The steady-state probability 
that the walker will finally stay at node /' is the relevance 
score of node /' with respect to the labeled nodes [25]: 

u = (1 — a)Pu + aq 

where u = [uj\ is the steady-state probability of relevance 
scores of different nodes, P is the affinity matrix associated 
with the instances in Markov chain transition probability 
graph, and q is the label distribution vector containing the 
elements of labeled instances being 1 and 0 for others. 
Here, the steady-state probability (relevance score of the 
instances) captures the global structure of the graph and 
relationship between the nodes. The advantage of this ran- 
dom walk procedure is that it converges to a unique solu- 
tion for any initial u(0). The process converges fast, 
needing just a few iterations. The random walk and related 
methods have been shown to have good performance on 
the learning tasks mentioned above. In the following, we 
introduce the learning of M A via the Markov chain with 
restart using all the instances (both labeled and unlabeled). 
The process of learning M R is similar. 

Given the constructed attribute feature vector x t G X A 
for a node v,- G V, pairwise affinity A e R mxm between 
the nodes based on relational information are computed 
using the Gaussian kernel function as follows 



exp 



Xj\\2 



2a 2 



(1) 



where II the Euclidean distance between the i-th 

feature vector and the /-th feature vector in X A . The para- 
meter rj is a positive number to control the linkage in the 
manifold [26]. The m-by-m matrix A, with its (z',/')-th entry 
given by is always nonnegative. Similar to (1), using the 
Gaussian kernel to r, G X R leads to the affinity matrix R for 
relational features. We then set up Markov chain models 
for classifiers M A and M R based on A and R, respectively. 

For the classifier M A , we construct an m-by-m Markov 
transition probability matrix P by normalizing the entries 
of A with respect to each column, i.e., each column sum 
of P is equal to one, [P] y = 1. For such P, we model 
the probabilities of visiting the other instances from the 
current instance in a Markov chain transition probability 



graph. We construct a transition probability graph, all 
the labeled and unlabeled instances are linked together. 
Intuitively, a random walker starts from nodes with 
known label to propagate labels among labeled instances 
to the other unlabeled instances. The walker iteratively 
visits its neighborhood of nodes with the transition prob- 
ability graph based on A. 

Next we use the idea in topic-sensitive PageRank [27] 
as a Markov chain with restart [25] to solve the learning 
problem. The random walker has a probability of a to 
return to labeled instances at each step. It can be inter- 
preted that during each iteration each instance receives 
the label information from its neighbors via the random 
walk, and also retains its initial label information. The 
parameter aspecifies the relative amount of the informa- 
tion from its neighbors and its initial label information. 
Using this approach, we compute the steady state prob- 
abilities that the random walker finally stay at different 
instances. These steady state probabilities give ranking 
of labels to indicate the importance of a set of labels to 
an unlabeled instance. 

More formally, we adopt the following equation: 



U = (1 -a)PU + aQ, 



(2) 



to compute the steady probabilities U = [u 1( u 2 , u c ] 
(m-by-c matrix) according to P and Q = [q 1; q 2 , ... , q c ] 
(m-by-c matrix) which is the assigned probability distri- 
bution vector of the class labels that are constructed 
from the labeled data. The restart parameter 0 < a <1 
controls the importance of the assigned probability dis- 
tributions in the labeled data to the resulting label rank- 
ing scores of instances. Given the training data, one 
simple way to construct q d is using a uniform distribu- 
tion on the instances with the label class d. The summa- 
tions of the entries of q^ is equal to 1. More precisely, 



l/k, ifd e Y { 
0, otherwise. 



(3) 



where Id is the number of instances with the label 
class d in the training data. 

The steady probability distribution vector U is solved 
by the iteration method with an initial matrix U 0 where 
each column is a probability distribution vector. The 
overall algorithm is summarized in Algorithm 2. 
Algorithm 2 Markov Chain based Classifer 
Input: P, Q and U 0 , a, and the tolerance e 
Outputthe steady probability distribution matrix U 
Procedure: 

1: Set t = 1; 

2: Compute U t = (1 - a)PU t _i + aQ; 
3: If ||U t - U t -i\\ < e, then stop, set U = U 4 ; other- 
wise set t - t + 1 and goto Step 2. 
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Experimental results 

In this section, we compare the performance of ICAM 
algorithm with other ICA-type collective classification 
algorithms: ICA, Gibbs and ICML. We show that the 
proposed algorithm outperforms these algorithms given 
limited number of labeled training data. 

KDD Cup 2001 data and baselines 

The first experiment is conducted for Yeast gene func- 
tion prediction from KDD Cup 2001 [28]. The dataset 
includes 1,243 genes and 1,806 interactions among the 
pair of genes encoding the proteins physically interact 
with one another. These interaction relationships are 
symmetric. The protein functions are autocorrelated in 
this dataset and a subset of the data have been withheld 
for testing. The task is to predict the functions of the 
proteins encoded by the genes. There are 14 functions 
and a protein can have one (or several) function(s). 

We compare our proposed method with the following 
three baseline learners: 

1. ICA. The Iterative Classification Algorithm (ICA) 
algorithm proposed by Neville et al. [15] is one of the sim- 
plest and most popular CC methods that is frequently 
used as baseline for CC evaluation in previous studies. For 
multi-label problem, we transform it into multiple single- 
label prediction problems using one-against-all strategy 
and employ ICA to make prediction for each single-label 
problem. 

2. Gibbs. This baseline is another ICA-type CC algo- 
rithm using the ICA iterative classification framework. In 
each iteration, Gibbs re-samples the label of each node 
based on the estimated label distribution [16]. We also 
use one against-all strategy to convert the multi-label 
problem into multiple single-label problems for the 
Gibbs algorithm. 

3. ICML. This baseline is a multi-label CC algorithm 
proposed by Kongetal. [29]. ICML extends the ICA 
algorithm to multi-label problems by considering depen- 
dencies among the label set in the iteration process. 

In the experiments, we use £NN as node classifier for 
ICA, Gibbs and ICML. The parameter k was automatically 
selected in the range of 10 to 30 at an increment of 5 
using 3-fold cross validation on the training set. For the 
proposed ICAM method, we learn the classifiers M A and 
M R using Markov chain based models to perform separate 
predictions. We set the value of a in the Markov chain 
model to be 0.95 as suggested in [23]. 

Evaluation criteria 

"We evaluate the performance of our proposed method 
with four multi-label evaluation measures: average preci- 
sion, coverage, ranking loss, and one-error. They are 
commonly used for multi-label learning algorithm 
evaluation. 



Given a multi-label dataset D = {{x it Y,)|l < i < m}, 
where X; G X is an instance and Y; C y is the true labels 
of x b and Y t = (Y n , Y a , Y ic ) e {0, 1} C . Here x t belongs 
to the /-th label when Yy = 1, otherwise Yy = 0, and c is 
the number of possible labels. The evaluation measures 
are defined using the following two outputs provided by 
the learning algorithms: s{x it I) returns a real-value that 
indicates the confidence for the class label / to be a 
proper label of Xf, rank s (x it I) returns the ranks of class 
label / derived from s{x t , I). 

Coverage [30] evaluates how far we need, on the aver- 
age, to go down on the list of labels in order to cover all 
the true labels of an instance: 

I m 

coverage(f) = — ^ max rank s {xi, I) — I. 

m , = i leY, 

Ranking loss [30] evaluates the average fraction of 
label pairs that are reverse ordered for the instance: 

1 m 1 
rlosstf) = - V ^.\Ki\, 

where Hi = {{h, h)\h{xu h) < h{xi,h), (h,h) e Y t x Y;|}. 

One-error [30] evaluates how many times the top- 
ranked label is not in the set of true labels of the 
instance. Define a classifier H that assigns a single label 
to an instance x t by H(x*) = argmax; e yh (x;, /), then the 
one-error is 



^ m 

one - error{H) = - [[H(x,) i Y;]] 

m i=i 



Average precision [30] evaluates the average fraction of 
labels ranked above a particular label / G Y in Y: 



avgprec(f) 



1 m 1 

-T — T 

m t—l IV-I t—l 



Wi\ 



m tl t^Y, mnk s{Xi,l) 



where Vi = {I' e Yi\rank s (xi, I') < rank s {xi, I)}. 

The smaller the value of coverage ranking loss and 
one-error, the better the performance. As for average 
precision, the bigger the value the better the perfor- 
mance, we report the results of 1-average precision. 
Thus, for all evaluation metrics, the smaller the value 
the better the performance. 

Results on KDD Cup 2001 data 

In this experiment, we test the performance of our pro- 
posed ICAM algorithm on the KDD Cup 2001 dataset. 
We randomly select 50% of data as training set, and use 
the remaining 50% of data as test set. The experiment is 
conducted 10 times by randomly selected training/test 
split (each with a different random seed), and we report 
the results of mean as well as standard deviation of each 
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compared algorithm. The mean as well as standard 
deviation of each compared method over the same 10 
trails are reported. 

Table 1 shows the performance of each compared 
method on the Yeast protein dataset. The best perfor- 
mance achieved among different compared algorithms is 
marked in bolded. The results show the competitiveness 
of the ICAM method against other learning methods. 
Difference evaluation measures for the learning perfor- 
mance are used in the experiments. One algorithm 
rarely outperforms another algorithm in all criteria. For 
example, a method that is optimal for instance ranking 
loss usually does not perform well in coverage or one- 
error [31]. In the experiments, we find that ICAM is 
able to produce better results across all evaluation 
metrics. These results are impressive and imply that the 
ICAM algorithm is a good collective classification 
method for protein function prediction. 

We also test the performance of different comparable 
algorithms with different number of labeled instances ran- 
ging from 200 to 1000 with an increment of 200. For 
example, we randomly pick up 200 instances as training 
data. The remaining data is used for testing. The experi- 
ment is conducted 10 times by randomly selecting train- 
ing/test split. We report the results of mean as well as 
standard deviation of each compared algorithms. Figure 2 
shows the performance of ICAM and other learning meth- 
ods with respect to different number of labeled instances. 

We can see from the figure that ICAM (the black line) 
has the best performance in general. ICAM outperforms 
other algorithms using different number of training data, 
especially when the size of training data is small. Specifi- 
cally, ICAM achieves coverage improvement of 0.4916 
over the second best method Gibbs (ICAM:4.2213 versus 
Gibbs:4.7129) and achieves 0.0439 improvement on rank- 
ing loss (ICAM:0.1184 versus ICML:0.1623) when the 
number of training instance is 200. As the size of training 
data increases, ICAM consistently achieves better perfor- 
mance than other learning algorithms across all evalua- 
tion criteria. 

We find that ICAM outperforms the other ICA-type 
methods substantially in terms of coverage. On the other 
hand, ICAM only slightly outperforms other methods in 
terms of one-error. We note that one-error and coverage 
are two different quantitative measures. One-error evalu- 
ates how many times the top-mnkeA label is not in the 
set of possible labels. Thus, if the goal of a prediction 



system is to assign a single function to a protein (single- 
label classification), the one-error is identical to test 
error. Whilst coverage measures how far we need, on the 
average, to go down on the list of the labels in order to 
cover all the possible labels assigned to a protein. Cover- 
age is loosely related to precision at the level of perfect 
recall [30]. The experimental results indicate that the 
top-rank label predictions from other ICA-type methods 
are as accurate as those from ICAM, but the predictions 
from ICAM are more complete than other ICA-type 
methods. A reasonable explanation for this finding is that 
the ICA-type methods focused on the single-label setting. 
In this case, the multi-label problem is first transformed 
into multiple single-label prediction problems, and then 
the ICA-type methods use independent classifiers 
induced from labeled training data for each problem. 
Nevertheless, ICA-type approaches ignore the effect of 
unlabeled data and the interdependence of the protein 
functions. On the other hand, our proposed ICAM 
approach is based on Markov chain based transductive 
learning method that uses both label and unlabeled data 
for label propagation. The Markov chain based method 
takes the correlation of the classes into account to effec- 
tively compute ranking of labels to an instance. There- 
fore, ICAM provides an opportunity to leverage the 
individual ICA-type classifiers to achieve higher coverage 
of predictions. 

Results on KDD Cup 2002 data 

To validate the effectiveness of the proposed method 
when there are only a limited number of positive labeled 
training data, we conduct additional experiments on a 
relatively large scale Yeast dataset from KDD Cup 2002. 
It consists of 4507 instances (i.e., genes) from experi- 
ments with a set of cerevisiae (Yeast) strains. Each 
instance is described by various types of information 
that characterize the gene associated with the instance. 
The data sources for describing the instances include 
abstracts from the scientific literature (MEDLINE), gene 
localization and functions. We represent each instance 
by a feature vector with 20545 dimensions. The pairs of 
genes whose encoded proteins physically interact with 
one another. Such protein-protein interaction network 
consists of 1218 links. 

Each instance is labeled with one of three class labels 
"nc", "control" and "change". The "change" label indicates 
instances in which the activity of the hidden system was 



Table 1 The performance (mean ± standard deviation) of compared algorithms on the Yeast protein dataset. 



Methods Coverage Ranking Loss One-error 1 -Average Precision 

ICA 4.217 ± 0.273 0.140 ± 0.013 0.042 ± 0.005 0.155 ± 0.005 

Gibbs 4.319 ± 0.195 0.148 ± 0.008 0.043 ± 0.005 0.154 ±0.006 

ICML 4.409 ± 0.091 0.153 ± 0.006 0.043 ± 0.007 0.162 ± 0.006 

ICAM 3.748 ± 0.164 0.100 ± 0.008 0.041 ± 0.005 0.151 ± 0.005 
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Figure 2 The performance of different algorithms on the Yeast protein dataset with varying number of labeled instances 



significantly changed, but the activity of the control sys- 
tem was not significantly changed. The goal of the KDD 
Cup 2002 task is to learn a model that can accurately 
predict the genes that affect the hidden system but not 
the control system. In this case, the positive class consists 
of those genes with "change" labels and the negative class 
consists of those genes with either "nc" or the "control" 
label. This partition is highly imbalanced. The rate of 
positive instances is only 1.2%. Therefore, we base our 
evaluation analysis on Receiver Operating Characteristic 
(ROC) curves, which reflect the true positive rate of a 
classifier as a function of its false positive rate. ROC 
curves are commonly used for evaluating highly skewed 
binary classification problems. Recent study has shown 
that ROC curves have a deep connection to the preci- 
sion-recall (PR) curves [32]. 

To evaluate the performance of our ICAM algorithm, 
we compare it with the linear kernel SVM method that 
implemented by LIBSVM [33] . Figure 3 shows the results 
of ROC curves on the KDD Cup 2002 task for ICAM and 
SVM. The x-axis and j-axis of the figure refer to the false 



positive rate and true positive rate respectively. We see 
from the figure that our ICAM (the red curve), outper- 
forms the SVM method (the blue curve) in general. 
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Figure 3 ROC curve of baseline SVM and our ICAM method 
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Table 2 The description of experimental datasets used in the experiments on collaboration networks. 

Datasets Number of Instances Number of Attributes Number of Links Number of Classes 

DBLP-A 23,806 12,588 150,042 6 

DBLP-B 16,020 8,595 95,108 6 



ICAM achieves improvement of 10.0% (0.713 versus 
0.613) on area under the ROC curves. The experimental 
results imply that the proposed ICAM method is able to 
deliver better performance in the situation of small posi- 
tive labeled data size. 

Experiments on collaboration networks 

In this section, we compare the performance of the pro- 
posed ICAM algorithm with other collective classification 
algorithms on 2 multi-label collaboration networks datasets 
to validate the effectiveness of the proposed method more 
thoroughly. These collaboration networks datasets are col- 
lected from the DBLP computer science bibliography web- 
site, and used in prior work to study the multi-label 
collective classification problems [29] . Their characteristics 
are listed in Table 2. Specifically, we extract DBLP coau- 
thorship networks that contain authors who have pub- 
lished papers during the years 2000-2010 as the nodes of 
the networks, and link any two authors who have collabo- 
rated with each other. At each node, we extract a bag-of- 
words representation of all the paper titles published by 
the author, and used it as the attributes of the node. Each 
author has one (or multiple) research topic(s) of interests 
from 6 research areas. The representative conferences from 
each area are selected as class labels. If an author has pub- 
lished papers in any of these conferences, we assume the 
author is interested in the corresponding research class. 
The task is to classify each author with a set of multiple 
research classes of interest. The conferences corresponding 



to the class labels of two datasets (DBLP-A and DBLP-B) 
are given as follows. 
The classes of DBLP-A are as follows: 

1 Database: ICDE, VLDB, SIGMOD, PODS, EDBT 

2 Data Mining: KDD, ICDM, SDM, PKDD, PAKDD 

3 Artificial Intelligence: IJCAI, AAAI 

4 Information Retrieval: SIGIR, ECIR 

5 Computer Vision: CVPR 

6 Machine Learning: ICML, ECML 

The classes of DBLP-B are as follows: 

1 Algorithms & Theory: STOC, FOCS, SODA, COLT 

2 Natural Language Processing: ACL, ANLP, 
COLING 

3 Bioinformatics: ISMB, RECOMB 

4 Networking: SIGCOMM, MOBICOM, INFOCOM 

5 Operating Systems: SOSP, OSDI 

6 Parallel Computing: POD, ICS 

We test ICAM and other ICA-type algorithms with dif- 
ferent number of labeled instances from 1000 to 5000 
with an increment of 500. The average results as well as 
standard deviation of a 10-time data split are given in 
Figure 4. The experimental results are in concordant with 
our previous study. We observe that ICAM consistently 
outperforms the other ICA-type methods on these two 
datasets, especially when there are only limited number of 
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Figure 4 The coverage performance of different algorithms with varying number of labeled instances: (a) DBLP-A dataset; (b) DBLP-B 
dataset. 
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labeled instances, i.e. larger(smaller) improvement is 
obtained with less(more) labeled data. 

Conclusion 

In this paper, we studied the label deficiency problem in 
collective classification (CC). We showed the protein 
function prediction problem from PPI networks can be 
formulated as a problem, and proposed an effective and 
novel Markov chain based CC learning algorithm, namely 
ICAM. It focuses on how to use labeled and unlabeled 
data to enhance the classification performance of PPI 
network data. Experimental results on two real-world 
Yeast PPI network datasets and two collaboration net- 
work datasets showed that our proposed ICAM method 
is effective in learning CC tasks in the paucity of labeled 
data. In future, we will consider other semi-supervised 
learning techniques for collective classification in PPI 
network data and we will also research on other complex 
biological networks, such as heterogeneous network 
classification. 
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